I. Introduction


The National Basketball Association(NBA) is a men’s professional basketball league in North America composed of 30 teams. With David Stern’s(the fourth NBA Commissioner) great efforts, NBA turns into the sport of the modern world from an unknown commodity outside the United States. Besides its business modes and fame, NBA’s games are going through great revolutions during the last 15 years. 1 2

Today we are able to analyze the teams’ and players’ performance from different angles using their gaming data instead of simply watching video records, which provides us more ways to learn and enjoy the basketball games.

In this report, we are going to use data to find out how NBA changes for the past 15 years. We are curious about how the overall strategy of teams changes as well as how players adapt to these changes.



II. Data source


Collection

Chao Yin is mainly responsible for collecting team/player game stats data while Zeyu Yang is responsible for players’ biographical information.

Our data is collected from Basketball-Reference and Stats NBA.

  • Basketball Reference is a site providing both basic and sabermetric statistics and resources for basketball fans using official NBA data.

  • Stats NBA is the home of NBA Advanced Stats and provides official NBA Statistics and advanced analytics.

Data in Basketball-Reference is stored in XML so that we can directly extract them using packages XML and RCurl. However, some tables on this site are commented and they can only be downloaded manually in CSV form thus we choose Stats NBA for other data. It’s a bit harder to extract data tables from Stats NBA than from Basketball-Reference since they are stored in JSON files. We use statsnbaR which provides utility functions to download data from the API end-points of Stats NBA. We got teams from Basketball -Reference and players from Stats NBA.

Datasets and variables

players datasets contain all regular season information of all players in one season.


General data provides basic players’ performance including:

  • Profile information: Name, Team, Age, Game Played, Minutes Played, etc.

  • Shooting performance from 2 pointer, 3 pointer, and free throw: Field Goals Made, Field Goals Attempted, Field Goal Percentage, etc.

  • Basic stats per game: Rebounds, Assists, Steals, Blocks, Points, Turnovers, Personal Fouls, etc.


Advanced data measures and analyzes players’ abilities in one specific area:

  • Overall ratings: Offensive Rating, Defensive Rating, Net Rating, Player Impact Estimate, Usage Percentage, etc.

  • Passing/Assist ability: Assist Percentage, Assist to Turnover Ratio, Assist Ratio

  • Rebound ability: Offensive Rebound Percentage, Defensive Rebound Percentage, Rebound Percentage

  • Shooting ability: Effective Field Goal Percentage, True Shooting Percentage


Bio dataset contains players’ biographical data:

  • The year player starts playing at NBA and the year he retires

  • Height and weight data

  • Birthdate

  • College attended


teams datasets contain similar information as shown in the players but corresponds to each team in the league. However, teams provides ways to split the data in order to measure the teams’ performance from different angles:

  • Location helps measure teams’ gaming performance at home or on the road respectively

  • Wins-Losses tells how the team played when they won or lost the game

  • Month and Pre/Post All Stars give teams’ performance changes over time periods

  • Days Rest tests teams’ abilities to handle tough schedules

Issues/Problems

  • Teams in NBA keep changing in these 15 years. Three teams change their team locations and team names thus we may find the teams are not necessarily the same each year.

  • Players can be traded and signed during the season, which makes some players have more records than others in these datasets.

  • Height data in bio dataset is saved as character, such as “6-8”, which requires us to convert them to numeric.

  • All data are saved as factor, which requires us to convert them to numeric or character.



III. Data cleaning


After we got all the raw data in data/raw, we wanted to combine them into four datasets: Team_splits, Team_shoots, Player & Players_bio, which are stored in data/tidy.

For the players’ data, we first remove empty rows and columns and turn the variables into numerics and characters according to their content. Considering more and more players can play more than one position today, we group the players into three kinds: Guards, Wings, and Bigs instead of the origin positions they play. And finally, we combined players data of all 15 years and got Player.

* Scroll down the table to see more details

print(dfSummary(Player,
                headings = FALSE,
                plain.ascii = FALSE,
                valid.col = FALSE,
                graph.magnif = 0.75,
                style = "grid",
                max.distinct.values = 5,
                varnumbers = FALSE),
      max.tbl.height = 500,method='render')
Variable Stats / Values Freqs (% of Valid) Graph Missing
Player [character] 1. Kyle Korver 2. Devin Harris 3. Pau Gasol 4. Trevor Ariza 5. Vince Carter [ 1640 others ]
19(0.2%)
18(0.2%)
18(0.2%)
18(0.2%)
18(0.2%)
8505(98.9%)
0 (0%)
Pos [factor] 1. Guards 2. Wings 3. Bigs
3501(40.7%)
1584(18.4%)
3511(40.8%)
0 (0%)
Age [numeric] Mean (sd) : 26.6 (4.2) min < med < max: 18 < 26 < 44 IQR (CV) : 6 (0.2) 26 distinct values 0 (0%)
Tm [character] 1. HOU 2. CLE 3. MEM 4. NYK 5. LAC [ 30 others ]
315(3.7%)
314(3.7%)
306(3.6%)
303(3.5%)
296(3.4%)
7062(82.2%)
0 (0%)
G [numeric] Mean (sd) : 46.6 (26.6) min < med < max: 1 < 51 < 82 IQR (CV) : 49 (0.6) 82 distinct values 0 (0%)
GS [numeric] Mean (sd) : 22.6 (27.9) min < med < max: 0 < 7 < 82 IQR (CV) : 41.2 (1.2) 83 distinct values 0 (0%)
MP [numeric] Mean (sd) : 19.8 (10) min < med < max: 0 < 19.2 < 43.1 IQR (CV) : 16.4 (0.5) 410 distinct values 0 (0%)
FG [numeric] Mean (sd) : 3 (2.1) min < med < max: 0 < 2.5 < 12.2 IQR (CV) : 2.9 (0.7) 111 distinct values 0 (0%)
FGA [numeric] Mean (sd) : 6.7 (4.5) min < med < max: 0 < 5.6 < 27.2 IQR (CV) : 6.3 (0.7) 228 distinct values 0 (0%)
FG% [numeric] Mean (sd) : 0.4 (0.1) min < med < max: 0 < 0.4 < 1 IQR (CV) : 0.1 (0.2) 452 distinct values 52 (0.6%)
3P [numeric] Mean (sd) : 0.6 (0.7) min < med < max: 0 < 0.3 < 5.1 IQR (CV) : 1 (1.2) 44 distinct values 0 (0%)
3PA [numeric] Mean (sd) : 1.7 (1.8) min < med < max: 0 < 1.1 < 13.2 IQR (CV) : 2.7 (1.1) 95 distinct values 0 (0%)
3P% [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 1 IQR (CV) : 0.2 (0.6) 376 distinct values 1335 (15.53%)
2P [numeric] Mean (sd) : 2.4 (1.9) min < med < max: 0 < 1.9 < 10.3 IQR (CV) : 2.4 (0.8) 99 distinct values 0 (0%)
2PA [numeric] Mean (sd) : 5 (3.7) min < med < max: 0 < 3.9 < 22.2 IQR (CV) : 4.8 (0.7) 198 distinct values 0 (0%)
2P% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1 IQR (CV) : 0.1 (0.2) 444 distinct values 94 (1.09%)
eFG% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) 468 distinct values 52 (0.6%)
FT [numeric] Mean (sd) : 1.4 (1.4) min < med < max: 0 < 1 < 10.3 IQR (CV) : 1.4 (1) 92 distinct values 0 (0%)
FTA [numeric] Mean (sd) : 1.9 (1.7) min < med < max: 0 < 1.4 < 11.7 IQR (CV) : 1.9 (0.9) 112 distinct values 0 (0%)
FT% [numeric] Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.8 < 1 IQR (CV) : 0.2 (0.2) 576 distinct values 438 (5.1%)
ORB [numeric] Mean (sd) : 0.9 (0.8) min < med < max: 0 < 0.7 < 6 IQR (CV) : 1 (0.9) 54 distinct values 0 (0%)
DRB [numeric] Mean (sd) : 2.6 (1.8) min < med < max: 0 < 2.2 < 12 IQR (CV) : 2.1 (0.7) 111 distinct values 0 (0%)
TRB [numeric] Mean (sd) : 3.5 (2.5) min < med < max: 0 < 2.9 < 18 IQR (CV) : 2.9 (0.7) 148 distinct values 0 (0%)
AST [numeric] Mean (sd) : 1.8 (1.8) min < med < max: 0 < 1.2 < 12.8 IQR (CV) : 1.8 (1) 113 distinct values 0 (0%)
STL [numeric] Mean (sd) : 0.6 (0.4) min < med < max: 0 < 0.5 < 2.9 IQR (CV) : 0.6 (0.7) 30 distinct values 0 (0%)
BLK [numeric] Mean (sd) : 0.4 (0.5) min < med < max: 0 < 0.2 < 6 IQR (CV) : 0.4 (1.2) 39 distinct values 0 (0%)
TOV [numeric] Mean (sd) : 1.1 (0.8) min < med < max: 0 < 1 < 5.7 IQR (CV) : 0.9 (0.7) 51 distinct values 0 (0%)
PF [numeric] Mean (sd) : 1.8 (0.8) min < med < max: 0 < 1.8 < 6 IQR (CV) : 1.2 (0.5) 46 distinct values 0 (0%)
PTS [numeric] Mean (sd) : 8 (5.9) min < med < max: 0 < 6.5 < 36.1 IQR (CV) : 7.9 (0.7) 301 distinct values 0 (0%)
PER [numeric] Mean (sd) : 12.7 (6.1) min < med < max: -54.4 < 12.6 < 133.8 IQR (CV) : 6.1 (0.5) 412 distinct values 3 (0.03%)
TS% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) 481 distinct values 25 (0.29%)
3PAr [numeric] Mean (sd) : 0.2 (0.2) min < med < max: 0 < 0.2 < 1 IQR (CV) : 0.4 (0.9) 784 distinct values 26 (0.3%)
FTr [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 6 IQR (CV) : 0.2 (0.8) 778 distinct values 26 (0.3%)
ORB% [numeric] Mean (sd) : 5.5 (4.8) min < med < max: 0 < 4.1 < 100 IQR (CV) : 6.3 (0.9) 222 distinct values 3 (0.03%)
DRB% [numeric] Mean (sd) : 14.5 (6.5) min < med < max: 0 < 13.5 < 100 IQR (CV) : 8.5 (0.4) 354 distinct values 3 (0.03%)
TRB% [numeric] Mean (sd) : 10 (5) min < med < max: 0 < 9 < 86.4 IQR (CV) : 7.1 (0.5) 265 distinct values 3 (0.03%)
AST% [numeric] Mean (sd) : 12.7 (9.2) min < med < max: 0 < 9.8 < 78.5 IQR (CV) : 10.9 (0.7) 470 distinct values 3 (0.03%)
STL% [numeric] Mean (sd) : 1.6 (0.9) min < med < max: 0 < 1.5 < 12.5 IQR (CV) : 0.8 (0.6) 80 distinct values 3 (0.03%)
BLK% [numeric] Mean (sd) : 1.6 (1.7) min < med < max: 0 < 1 < 26.3 IQR (CV) : 1.7 (1.1) 109 distinct values 3 (0.03%)
TOV% [numeric] Mean (sd) : 13.9 (6.2) min < med < max: 0 < 13.2 < 100 IQR (CV) : 5.8 (0.4) 341 distinct values 21 (0.24%)
USG% [numeric] Mean (sd) : 18.6 (5.3) min < med < max: 0 < 18.2 < 53.7 IQR (CV) : 6.8 (0.3) 334 distinct values 3 (0.03%)
OWS [numeric] Mean (sd) : 1.3 (2) min < med < max: -3.3 < 0.6 < 14.8 IQR (CV) : 2 (1.6) 156 distinct values 0 (0%)
DWS [numeric] Mean (sd) : 1.2 (1.2) min < med < max: -0.6 < 0.9 < 9.1 IQR (CV) : 1.5 (1) 80 distinct values 0 (0%)
WS [numeric] Mean (sd) : 2.5 (2.9) min < med < max: -2.1 < 1.6 < 20.3 IQR (CV) : 3.5 (1.2) 184 distinct values 0 (0%)
WS/48 [numeric] Mean (sd) : 0.1 (0.1) min < med < max: -1.3 < 0.1 < 2.7 IQR (CV) : 0.1 (1.4) 557 distinct values 3 (0.03%)
OBPM [numeric] Mean (sd) : -1.6 (3.6) min < med < max: -46.4 < -1.4 < 68.6 IQR (CV) : 3.4 (-2.2) 283 distinct values 0 (0%)
DBPM [numeric] Mean (sd) : -0.4 (2.1) min < med < max: -23.1 < -0.4 < 17.1 IQR (CV) : 2.4 (-4.8) 185 distinct values 0 (0%)
BPM [numeric] Mean (sd) : -2 (4.3) min < med < max: -59 < -1.7 < 54.4 IQR (CV) : 4.3 (-2.1) 334 distinct values 0 (0%)
VORP [numeric] Mean (sd) : 0.6 (1.3) min < med < max: -2.2 < 0 < 12.4 IQR (CV) : 1.1 (2.3) 112 distinct values 0 (0%)
year [integer] Mean (sd) : 2011.7 (4.7) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) 16 distinct values 0 (0%)

For Players_bio data, we join players’ data and biographical data and turn the variables into numerics and characters according to their content.

* Scroll down the table to see more details

print(dfSummary(Players_bio,
                headings = FALSE,
                plain.ascii = FALSE,
                valid.col = FALSE,
                graph.magnif = 0.75,
                style = "grid",
                max.distinct.values = 5,
                varnumbers = FALSE),
      max.tbl.height = 500,method='render')
Variable Stats / Values Freqs (% of Valid) Graph Missing
Rk [numeric] Mean (sd) : 239.6 (137.8) min < med < max: 1 < 239 < 540 IQR (CV) : 238 (0.6) 540 distinct values 0 (0%)
Player [character] 1. Mike James 2. Mike Dunleavy 3. Chris Johnson 4. David Lee 5. Corey Brewer [ 1641 others ]
42(0.4%)
36(0.4%)
28(0.3%)
28(0.3%)
22(0.2%)
9566(98.4%)
0 (0%)
Pos [character] 1. SG 2. PF 3. PG 4. C 5. SF [ 10 others ]
1984(20.4%)
1973(20.3%)
1945(20.0%)
1886(19.4%)
1778(18.3%)
156(1.6%)
0 (0%)
Age [numeric] Mean (sd) : 26.6 (4.2) min < med < max: 18 < 26 < 44 IQR (CV) : 7 (0.2) 26 distinct values 0 (0%)
Tm [character] 1. TOT 2. HOU 3. CLE 4. NYK 5. MEM [ 31 others ]
986(10.1%)
321(3.3%)
319(3.3%)
313(3.2%)
309(3.2%)
7474(76.9%)
0 (0%)
G [numeric] Mean (sd) : 46.6 (26.3) min < med < max: 1 < 51 < 85 IQR (CV) : 49 (0.6) 85 distinct values 0 (0%)
GS [numeric] Mean (sd) : 21.9 (27.4) min < med < max: 0 < 7 < 83 IQR (CV) : 40 (1.2) 84 distinct values 0 (0%)
MP [numeric] Mean (sd) : 1078 (877.4) min < med < max: 0 < 887 < 3424 IQR (CV) : 1483 (0.8) 2828 distinct values 0 (0%)
FG [numeric] Mean (sd) : 166.1 (164.4) min < med < max: 0 < 114 < 978 IQR (CV) : 225 (1) 727 distinct values 0 (0%)
FGA [numeric] Mean (sd) : 366.9 (352.8) min < med < max: 0 < 260 < 2173 IQR (CV) : 489 (1) 1370 distinct values 0 (0%)
FG% [numeric] Mean (sd) : 0.4 (0.1) min < med < max: 0 < 0.4 < 1 IQR (CV) : 0.1 (0.2) 458 distinct values 53 (0.55%)
3P [numeric] Mean (sd) : 33.1 (46.4) min < med < max: 0 < 10 < 402 IQR (CV) : 52 (1.4) 249 distinct values 0 (0%)
3PA [numeric] Mean (sd) : 93 (122.9) min < med < max: 0 < 34 < 1028 IQR (CV) : 146 (1.3) 552 distinct values 0 (0%)
3P% [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 1 IQR (CV) : 0.2 (0.6) 380 distinct values 1490 (15.33%)
2P [numeric] Mean (sd) : 133 (140.6) min < med < max: 0 < 85 < 798 IQR (CV) : 174 (1.1) 644 distinct values 0 (0%)
2PA [numeric] Mean (sd) : 273.9 (280.7) min < med < max: 0 < 182 < 1655 IQR (CV) : 350 (1) 1140 distinct values 0 (0%)
2P% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1 IQR (CV) : 0.1 (0.2) 446 distinct values 98 (1.01%)
eFG% [numeric] Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) 473 distinct values 53 (0.55%)
FT [numeric] Mean (sd) : 80.1 (98.8) min < med < max: 0 < 44 < 756 IQR (CV) : 99 (1.2) 515 distinct values 0 (0%)
FTA [numeric] Mean (sd) : 105.7 (125.1) min < med < max: 0 < 61 < 916 IQR (CV) : 129 (1.2) 615 distinct values 0 (0%)
FT% [numeric] Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.8 < 1 IQR (CV) : 0.2 (0.2) 582 distinct values 475 (4.89%)
ORB [numeric] Mean (sd) : 48.4 (57.1) min < med < max: 0 < 28 < 440 IQR (CV) : 56 (1.2) 310 distinct values 0 (0%)
DRB [numeric] Mean (sd) : 139.5 (137.5) min < med < max: 0 < 102 < 894 IQR (CV) : 174 (1) 650 distinct values 0 (0%)
TRB [numeric] Mean (sd) : 187.8 (188.3) min < med < max: 0 < 133 < 1247 IQR (CV) : 228 (1) 837 distinct values 0 (0%)
AST [numeric] Mean (sd) : 97.4 (123.3) min < med < max: 0 < 53 < 925 IQR (CV) : 117 (1.3) 610 distinct values 0 (0%)
STL [numeric] Mean (sd) : 33.6 (32.5) min < med < max: 0 < 24 < 217 IQR (CV) : 44 (1) 179 distinct values 0 (0%)
BLK [numeric] Mean (sd) : 21.2 (30.4) min < med < max: 0 < 10 < 307 IQR (CV) : 23 (1.4) 208 distinct values 0 (0%)
TOV [numeric] Mean (sd) : 61.4 (59.5) min < med < max: 0 < 44 < 464 IQR (CV) : 78 (1) 304 distinct values 0 (0%)
PF [numeric] Mean (sd) : 93.3 (71.2) min < med < max: 0 < 83 < 332 IQR (CV) : 117 (0.8) 304 distinct values 0 (0%)
PTS [numeric] Mean (sd) : 445.5 (448.9) min < med < max: 0 < 303 < 2832 IQR (CV) : 600 (1) 1656 distinct values 0 (0%)
Year [numeric] Mean (sd) : 2011.7 (4.7) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) 16 distinct values 0 (0%)
year_start [integer] Mean (sd) : 2006.5 (6.7) min < med < max: 1952 < 2007 < 2018 IQR (CV) : 10 (0) 44 distinct values 275 (2.83%)
year_end [integer] Mean (sd) : 2014.3 (5) min < med < max: 1958 < 2016 < 2018 IQR (CV) : 6 (0) 31 distinct values 275 (2.83%)
position [character] 1. G 2. F 3. C 4. F-C 5. G-F [ 2 others ]
3410(36.1%)
2746(29.1%)
1064(11.3%)
795(8.4%)
745(7.9%)
687(7.3%)
275 (2.83%)
height [numeric] Mean (sd) : 200.6 (9.1) min < med < max: 165.1 < 200.7 < 228.6 IQR (CV) : 15.2 (0) 22 distinct values 275 (2.83%)
weight [integer] Mean (sd) : 219.8 (26.9) min < med < max: 135 < 220 < 360 IQR (CV) : 40 (0.1) 120 distinct values 275 (2.83%)
birth_date [character] 1. June 26, 1984 2. June 1, 1985 3. March 25, 1986 4. May 19, 1976 5. August 17, 1986 [ 1411 others ]
45(0.5%)
35(0.4%)
29(0.3%)
28(0.3%)
27(0.3%)
9283(98.3%)
275 (2.83%)
college [character] 1. 2. University of Kentucky 3. Duke University 4. University of North Carol 5. University of California, [ 229 others ]
1582(16.7%)
326(3.5%)
287(3.0%)
268(2.8%)
242(2.6%)
6742(71.4%)
275 (2.83%)

For teams’ data, we split them into two datasets Team_split and Team_shooting.

Teams_splits contains all the ‘per game’ stats for each 30 team every season. We choose ‘Location’ to filter because all the teams have to play 41 Home game and 41 Road games every year and we simply calculate the mean to get seasonal average stats. We changed the format, removed the ranking variables, combined the basic with advanced data, and put all 15 years data into this one dataset.

* Scroll down the table to see more details

print(dfSummary(Team_splits,
                headings = FALSE,
                plain.ascii = FALSE,
                valid.col = FALSE,
                graph.magnif = 0.75,
                style = "grid",
                max.distinct.values = 5,
                varnumbers = FALSE),
      max.tbl.height = 500,method='render')
Variable Stats / Values Freqs (% of Valid) Graph Missing
team [character] 1. Atlanta Hawks 2. Boston Celtics 3. Chicago Bulls 4. Cleveland Cavaliers 5. Dallas Mavericks [ 31 others ]
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
16(3.3%)
399(83.3%)
0 (0%)
pctWins [numeric] Mean (sd) : 0.5 (0.2) min < med < max: 0.1 < 0.5 < 0.9 IQR (CV) : 0.2 (0.3) 115 distinct values 0 (0%)
fgm [numeric] Mean (sd) : 37.5 (2.1) min < med < max: 32.4 < 37.3 < 44 IQR (CV) : 2.7 (0.1) 168 distinct values 0 (0%)
fga [numeric] Mean (sd) : 82.5 (3.6) min < med < max: 74.2 < 82.2 < 94 IQR (CV) : 5.1 (0) 220 distinct values 0 (0%)
pctFG [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.5 IQR (CV) : 0 (0) 125 distinct values 0 (0%)
fg3m [numeric] Mean (sd) : 7.4 (2.3) min < med < max: 2.8 < 7 < 16.1 IQR (CV) : 3 (0.3) 164 distinct values 0 (0%)
fg3a [numeric] Mean (sd) : 20.7 (6.1) min < med < max: 8.2 < 19.5 < 45.3 IQR (CV) : 8.3 (0.3) 294 distinct values 0 (0%)
pctFG3 [numeric] Mean (sd) : 0.4 (0) min < med < max: 0.3 < 0.4 < 0.4 IQR (CV) : 0 (0.1) 478 distinct values 0 (0%)
pctFT [numeric] Mean (sd) : 0.8 (0) min < med < max: 0.7 < 0.8 < 0.8 IQR (CV) : 0 (0) 206 distinct values 0 (0%)
fg2m [numeric] Mean (sd) : 30.1 (1.9) min < med < max: 23.1 < 30.2 < 35.2 IQR (CV) : 2.4 (0.1) 151 distinct values 0 (0%)
fg2a [numeric] Mean (sd) : 61.8 (4.6) min < med < max: 41.9 < 62.1 < 74.3 IQR (CV) : 6.1 (0.1) 253 distinct values 0 (0%)
pctFG2 [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0) 479 distinct values 0 (0%)
ftm [numeric] Mean (sd) : 18.2 (2) min < med < max: 12.2 < 18.1 < 24.1 IQR (CV) : 2.6 (0.1) 153 distinct values 0 (0%)
fta [numeric] Mean (sd) : 24 (2.6) min < med < max: 16.6 < 23.9 < 31.6 IQR (CV) : 3.3 (0.1) 196 distinct values 0 (0%)
oreb [numeric] Mean (sd) : 11 (1.3) min < med < max: 7.6 < 10.9 < 14.6 IQR (CV) : 1.7 (0.1) 113 distinct values 0 (0%)
dreb [numeric] Mean (sd) : 31.5 (2.1) min < med < max: 26.9 < 31.2 < 40.5 IQR (CV) : 3 (0.1) 159 distinct values 0 (0%)
treb [numeric] Mean (sd) : 42.4 (2) min < med < max: 36.8 < 42.2 < 49.7 IQR (CV) : 2.7 (0) 154 distinct values 0 (0%)
ast [numeric] Mean (sd) : 21.9 (2) min < med < max: 17.4 < 21.6 < 30.4 IQR (CV) : 2.6 (0.1) 157 distinct values 0 (0%)
tov [numeric] Mean (sd) : 14.4 (1.1) min < med < max: 11.2 < 14.4 < 17.7 IQR (CV) : 1.4 (0.1) 106 distinct values 0 (0%)
stl [numeric] Mean (sd) : 7.5 (0.9) min < med < max: 5.5 < 7.5 < 10 IQR (CV) : 1.1 (0.1) 81 distinct values 0 (0%)
blk [numeric] Mean (sd) : 4.9 (0.8) min < med < max: 2.4 < 4.8 < 8.2 IQR (CV) : 1 (0.2) 78 distinct values 0 (0%)
blka [numeric] Mean (sd) : 4.9 (0.7) min < med < max: 3 < 4.9 < 6.9 IQR (CV) : 0.9 (0.1) 71 distinct values 0 (0%)
pf [numeric] Mean (sd) : 20.9 (1.7) min < med < max: 16.6 < 20.8 < 26.7 IQR (CV) : 2.4 (0.1) 137 distinct values 0 (0%)
pts [numeric] Mean (sd) : 100.5 (5.9) min < med < max: 85.5 < 99.7 < 118.2 IQR (CV) : 7.6 (0.1) 296 distinct values 0 (0%)
pfd [numeric] Mean (sd) : 19.5 (5.1) min < med < max: 0 < 20.4 < 25.6 IQR (CV) : 2.2 (0.3) 119 distinct values 32 (6.68%)
pctAST [numeric] Mean (sd) : 0.6 (0) min < med < max: 0.5 < 0.6 < 0.7 IQR (CV) : 0.1 (0.1) 237 distinct values 0 (0%)
pctOREB [numeric] Mean (sd) : 0.3 (0) min < med < max: 0.2 < 0.3 < 0.4 IQR (CV) : 0 (0.1) 191 distinct values 0 (0%)
pctDREB [numeric] Mean (sd) : 0.7 (0) min < med < max: 0.7 < 0.7 < 0.8 IQR (CV) : 0 (0) 174 distinct values 0 (0%)
pctTREB [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.5 < 0.5 < 0.5 IQR (CV) : 0 (0) 119 distinct values 0 (0%)
pctTOVTeam [numeric] Mean (sd) : 0.2 (0) min < med < max: 0.1 < 0.2 < 0.2 IQR (CV) : 0 (0.1) 112 distinct values 0 (0%)
pctEFG [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0) 172 distinct values 0 (0%)
pctTS [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.5 < 0.5 < 0.6 IQR (CV) : 0 (0) 151 distinct values 0 (0%)
ortgE [numeric] Mean (sd) : 104.1 (3.7) min < med < max: 92.3 < 103.9 < 113.9 IQR (CV) : 5.4 (0) 227 distinct values 0 (0%)
ortg [numeric] Mean (sd) : 105.7 (3.7) min < med < max: 94.4 < 105.3 < 114.9 IQR (CV) : 5.1 (0) 232 distinct values 0 (0%)
drtgE [numeric] Mean (sd) : 104.1 (3.6) min < med < max: 91.6 < 104.2 < 115.1 IQR (CV) : 5.1 (0) 229 distinct values 0 (0%)
drtg [numeric] Mean (sd) : 105.7 (3.5) min < med < max: 93.1 < 105.8 < 116.8 IQR (CV) : 4.9 (0) 223 distinct values 0 (0%)
netrtgE [numeric] Mean (sd) : 0 (5) min < med < max: -15.5 < 0 < 12.1 IQR (CV) : 7 (672.1) 274 distinct values 0 (0%)
netrtg [numeric] Mean (sd) : 0 (4.7) min < med < max: -15.1 < 0.1 < 11.4 IQR (CV) : 6.8 (420.7) 269 distinct values 0 (0%)
ratioASTtoTO [numeric] Mean (sd) : 1.5 (0.2) min < med < max: 1 < 1.5 < 2.1 IQR (CV) : 0.3 (0.1) 151 distinct values 0 (0%)
ratioAST [numeric] Mean (sd) : 16.8 (1.2) min < med < max: 14.1 < 16.7 < 21.2 IQR (CV) : 1.5 (0.1) 106 distinct values 0 (0%)
paceE [numeric] Mean (sd) : 95.7 (3.5) min < med < max: 88.6 < 95.3 < 106.5 IQR (CV) : 4.9 (0) 227 distinct values 0 (0%)
pace [numeric] Mean (sd) : 94.3 (3.4) min < med < max: 87.4 < 93.9 < 104.6 IQR (CV) : 4.8 (0) 432 distinct values 0 (0%)
ratioPIE [numeric] Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0.1) 211 distinct values 0 (0%)
year [integer] Mean (sd) : 2011.5 (4.6) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 7.5 (0) 16 distinct values 0 (0%)

Team_shooting contains all the shooting performance of each team from different regions on the court. We cleaned them the same way as Team_splits.

* Scroll down the table to see more details

print(dfSummary(Team_shooting,
                headings = FALSE,
                plain.ascii = FALSE,
                valid.col = FALSE,
                graph.magnif = 0.75,
                style = "grid",
                max.distinct.values = 5,
                varnumbers = FALSE),
      max.tbl.height = 500,method='render')
Variable Stats / Values Freqs (% of Valid) Graph Missing
team [character] 1. Atlanta Hawks 2. Boston Celtics 3. Chicago Bulls 4. Cleveland Cavaliers 5. Dallas Mavericks [ 31 others ]
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
80(3.3%)
1995(83.3%)
0 (0%)
distance [character] 1. 16-24 ft. 2. 24+ ft. 3. 8-16 ft. 4. Back Court Shot 5. Less Than 8 ft.
479(20.0%)
479(20.0%)
479(20.0%)
479(20.0%)
479(20.0%)
0 (0%)
fgm [numeric] Mean (sd) : 607.1 (528.8) min < med < max: 0 < 474 < 2259 IQR (CV) : 467.5 (0.9) 939 distinct values 0 (0%)
fga [numeric] Mean (sd) : 1335.9 (949.2) min < med < max: 3 < 1230 < 3891 IQR (CV) : 1225 (0.7) 1309 distinct values 0 (0%)
pctFG [numeric] Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.4 < 0.6 IQR (CV) : 0.1 (0.5) 293 distinct values 0 (0%)
year [integer] Mean (sd) : 2011.5 (4.6) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) 16 distinct values 0 (0%)

* To understand the meaning of all variables, please visit StatsNBA.



IV. Missing values


As we can see in the aforementioned tables, there is no missing value in Teams_splits and Team_shooting. Also, since Player and Players_bio are similar to each other, we are going to display the missing values of Players_bio here.

visna(Players_bio)
Figure: Missing values

Figure: Missing values

The first row in the above figure shows that the marjority of the data has no missing values.

Those rows that have missed year_start variable also missed all the following variables. This is because these columns come from another table: bio. Although the bio table itself has no missing values, it does not contain all the players as Player data has.

Also, we can see that there are quite some rows missing 3PA values, FT values, etc. These variables are related to players’ shooting data per season. The missing values mean that these players do not shoot that season.



V. Results


Age

data <- Players_bio%>%
  filter(Year>=2004)%>%
  select(Player,Age,Year)%>%
  distinct()%>%
  as.data.frame(stringsAsFactors = F)%>%
  select(Age,Year)
# ggplot(data,aes(x=as.factor(Year),y=Age))+geom_boxplot()

ggplot(data, aes(x=Age, y=Year,group=Year)) +
  stat_density_ridges(quantile_lines = TRUE, quantiles = 2, fill="grey80") +
  geom_text(data=data %>% group_by(Year) %>% 
              summarise(Age=median(Age)),
            aes(label=sprintf("%1.1f", Age)), 
            position=position_nudge(y=-0.1), colour="#17408B", size=3)+
  geom_text(data=data %>% group_by(Year) %>% 
              summarise(Age=min(Age)),
            aes(label=sprintf("%1.1f", Age)), 
            position=position_nudge(y=-0.1), colour="#17408B", size=3)+
  geom_text(data=data %>% group_by(Year) %>% 
              summarise(Age=max(Age)),
            aes(label=sprintf("%1.1f", Age)), 
            position=position_nudge(y=-0.1), colour="#17408B", size=3)+
  xlab("")+
  ylab("")+
  ggtitle("Age Distribution")+
  scale_y_continuous(breaks=seq(2004,2019))+
  scale_x_continuous(breaks=seq(15,45,3))+
  theme_minimal()+
  theme(
    plot.title = element_text(size=17.5,face="bold"),
    axis.text.x = element_text(color = "#000000", size = 11),
    axis.text.y = element_text(color = "#000000", size = 11))

The ridge plot presents the distribution of the NBA players’ age by year. The x-axis is the age, the y-axis is the year, the height of each line is the probability of this particular age.

The three numbers on each distribution are min age, median age and max age(from left to right) of that season.

As we can see from the plot, the distribution of age has not changed greatly–the majority ages of players are around 20-35. The median age changes slightly from 26 to 25.

We can also notice that there is a jump in the minimum age between 2006 and 2007. The minimum age before 2006 is 18 while the minimum age after 2006 is 19. This is because, in 2006, the NBA had increased the draft-eligible age from 18 to 19.

Another noticeable point is the maximum age. It has several increases and decreases. Every increase is mainly caused by one player. Take the increase from 2015 to 2019 as an example, the eldest player is Vince Carter. He is almost the eldest player in NBA history. The reason why these players are still playing is due to multiple reasons: they are still active players, they are not suffered from serious injuries, etc.


Height/Weight Ratio

data <- Players_bio%>%
  filter(Year>=2004)%>%
  select(Player,height,weight,Year)%>%
  distinct()%>%
  drop_na()%>%
  as.data.frame(stringsAsFactors = F)%>%
  select(height,weight,Year)%>%
  dplyr::group_by(Year)%>%
  dplyr::summarise(avg_h=mean(height),avg_w=mean(weight))%>%
  dplyr::ungroup()%>%
  mutate(hw_ratio=avg_h/avg_w)%>%
  select(Year,hw_ratio)


ggplot()+
  geom_line(aes(x=Year,y=hw_ratio),data=data,color="#C9082A",size=2)+
  geom_point(aes(x=Year,y=hw_ratio),data=data,color="#C9082A",size=4)+
  geom_point(aes(x=Year,y=hw_ratio),data=data,color="white",size=2)+
  scale_x_continuous(breaks=seq(2004,2019))+
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
  xlab('') +
  ylab('') +
  theme_minimal()+
  ggtitle("Average Height/Weight Ratio Per Season")+
  theme(plot.title = element_text(size=17.5,face="bold"),
        axis.text.x = element_text(angle = 45, hjust = 1,color = "#000000", size = 11),
        axis.text.y = element_text(color = "#000000", size = 11))

This plot presents the average height/weight ratio of the players. It reflects players’ body shape. There is a clear increasing trend of this ratio after 2011.

While the average height and average weight have not changed much for past 15 years(as we can see in the following table), the increase of the height/weight ratio means the weight is relatively decreasing compared with the height, which suggests the players are becoming more and more facile and fast.

data <- Players_bio%>%
  filter(Year>=2004)%>%
  select(Player,height,weight,Year)%>%
  distinct()%>%
  drop_na()%>%
  as.data.frame(stringsAsFactors = F)%>%
  select(height,weight,Year)%>%
  dplyr::group_by(Year)%>%
  dplyr::summarise(avg_h=round(mean(height),1),avg_w=round(mean(weight),1))%>%
  dplyr::ungroup()%>%
  column_to_rownames(var="Year")%>%
  t()%>%
  as.data.frame(stringsAsFactors = FALSE)
pander(data)
Table continues below
  2004 2005 2006 2007 2008 2009 2010 2011
avg_h 201 201.1 200.8 200.6 200.7 201 200.8 201.1
avg_w 220.5 221.2 220.8 220.7 220.3 221.4 221.6 223.3
  2012 2013 2014 2015 2016 2017 2018 2019
avg_h 200.7 200.8 200.7 200.7 200.9 200.8 200.5 200.7
avg_w 222.4 222.5 221.5 221.2 221.2 219.6 217.3 217.9

* The first row is the average height data(cm), the second row is the average weight data(pound).


Overall Offensive Performance

Team_splits %>% select(year, pts, pace) %>% group_by(year) %>% summarise(Pace = mean(pace), Points = mean(pts)) %>%
  gather(key = 'type', value = 'value', -year) %>%
  ggplot(aes(x = year, y = value)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  #scale_color_manual(values = c('#17408B', '#C9082A')) +
  facet_grid(type ~ ., scales = 'free_y') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  ggtitle('Pace and Points Per Game') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11), 
        axis.text.y = element_text(color = "#000000", size = 11), 
        strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'),
        legend.position = 'none')

There ’s an obvious trend in both Pace (the number of possessions a team uses per game) and PPG (Points Per Game) of NBA games in recent 15 years. The more possessions a team accumulates, the quicker the pace of the game.

We can see that from 2004 to 2013 the pace and PPG are fluctuating around 93 and 98 respectively, but from 2014 these two stats keep growing and especially in 2019 the pace rise to 101 from 98 last year and PPG increases by nearly 6 points more than last season. It’s easy to find a positive association between pace and PPG since the more possessions you have the more chances you can score, though also gives their opponents more chances.

* The formula for pace is: ((Tm Poss + Opp Poss) / (2 x (Tm MP / 5))). The first part of the equation sums Team Possessions and Opponent’s Possessions. The latter half of the equation uses Team Minutes Played, which is the total number of minutes played by each player on the team. StatNBA

Team_splits %>% select(year, ortg, pctWins) %>% group_by(year) %>%
  ggplot(aes(x = year, y=ortg, alpha = pctWins, color = pctWins)) +
  geom_jitter(size = 2) +
  scale_colour_gradient(low = "#8ec5ff", high = "#19293a",breaks
=c(0.2,0.4,0.6,0.8), labels=c("20%","40%","60%","80%"))+
  geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2, show.legend = F) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  guides(alpha=FALSE)+
  ggtitle('Average Offensive Rating Per Game') +
  xlab('') +
  ylab('') +
  labs(colour="% Win")+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11),
        axis.text.y = element_text(color = "#000000", size = 11),
        plot.title = element_text(size = 17.5, face = 'bold')) 

This plot shows average offrtg (offensive rating, a statistic used to measure a team’s offensive performance) of each team in these 15 years. The color reflects the Win Percentage of each team. The darker the marker is, the more the team wins.

The dashed line is the fitted line of the data. It seems that teams with a higher offensive rating(points above the dashed line) tend to have a higher Win Percentage(Darker points).

Offensive Rating shows that the offensive ability of each team started growing from 2013 and reached an unprecedented level in 2018. We are curious about is there any other reasons for such high offensive performance these years except the high pace?

* offrtg = 100x((Points)/(POSS). It measures a team’s points scored per 100 possessions. On a player level, this statistic is team points scored per 100 possessions while he is on court. StatNBA


Scoring

p1 <- Team_splits %>% select(year, fg3a, fg2a) %>%
  gather(key = 'type', value = 'attempt', -c(year)) %>%
  group_by(year, type) %>% summarise(attempt = mean(attempt)) %>%
  ggplot(aes(x = year, y = attempt, group = year)) +
  #geom_boxplot(aes(color = type)) +
  #geom_line() +
  geom_bar(stat = 'identity', fill = '#C9082A') +
  facet_grid(type ~., scales = 'free_y', labeller = as_labeller(c(`fg2a` = '2 pointer', `fg3a` = '3 pointer'))) +
  scale_color_manual(values = c('#17408B', '#C9082A')) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  #ylim(0, 2500) +
  ggtitle('Field Goals Attempt') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5,color = "#000000", size = 15),
        axis.text.y = element_text(color = "#000000", size = 15),
        strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'),
        legend.position = 'none') 

p2 <- Team_splits %>% select(year, pctFG3, pctFG2) %>% 
  gather(key = 'type', value = 'percentage', -c(year)) %>%
  group_by(year, type) %>% summarise(percentage = mean(percentage)) %>%
  ggplot(aes(x = year, y = percentage)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  facet_grid(type ~., scales = 'free_y', labeller = as_labeller(c(`pctFG2` = '2 pointer', `pctFG3` = '3 pointer'))) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
  xlab('') +
  ylab('') +
  ggtitle('Field Goals Percentage') +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5,color = "#000000", size = 15),
        axis.text.y = element_text(color = "#000000", size = 15),
        strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'))

grid.arrange(p1, p2, ncol = 2)

In basketball, a field goal is a basket scored on any shot or tap other than a free throw, worth two or three points depending on the distance of the attempt from the basket. An attempt is counted no matter this shot is scored.

This plot shows the FGA (Field Goal Attempt) and FG% (Field Goal Percentage) for both 2 pointer and 3 pointer of the league average performance(per team per game). Please note that the FG% only relates to the scored shots – they are the percentage of scored shots over all the attempts. The sum of 3 pointer FG% and 2 pointer FG% does not necessarily add up to 100%.

In the left plot, we find that teams are attempting more and more 3 pointer year by year without decreasing too much 2 pointer attempts. In 2019, FGA for 3 is more than twice of that 15 years ago. Also in 2019, FGA for 3 is beyond 30 and FGA for 2 is below 60, which means in average every three shots in an NBA game there is one 3 pointer shot in 2019.

The right plot tells the FG% of 2 pointer and 3 pointer from 2004 to 2019. It’s clear that the FG% for 2 keeps growing from 2012 and reached beyond 50% since 2017. The FG% for 3 is fluctuating between 35% and 36% in most years. We can see that teams are trying to make 2 pointer shots more efficient by increasing the FG% of it.

From these two plots, we can see that the strategy of NBA teams to score more is to try more 3 pointer and keep 2 pointer shots more efficient.

Team_shooting$distance <- factor(Team_shooting$distance, levels = unique(Team_shooting$distance))

p1 <- Team_shooting %>% filter(distance != 'Back Court Shot') %>% select(distance, fga, year) %>% group_by(year, distance) %>% summarise_all(mean) %>%
  ggplot(aes(x = year, y = fga/82, group = year)) +
  #geom_boxplot() +
  geom_bar(stat = 'identity', fill = '#C9082A') +
  facet_grid(distance ~ ., scales = 'free_y') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  xlab('') +
  ylab('') +
  #ylim(0,1500) +
  ggtitle('Field Goals Attempt by Distance') +
  theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 15), 
          axis.text.y = element_text(color = "#000000", size = 15),
          legend.position = 'none',
          strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
          strip.background = element_rect(fill = '#17408B', colour = 'white'),
          plot.title = element_text(size = 17.5, face = 'bold'))

p2 <- Team_shooting %>% filter(distance != 'Back Court Shot') %>% select(distance, pctFG, year) %>% group_by(year, distance) %>% summarise_all(mean) %>%
  ggplot(aes(x = year, y = pctFG)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  facet_grid(distance ~ ., scales = 'free_y') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
  xlab('') +
  ylab('') +
  ggtitle('Field Goals Percentage by Distance') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 15), 
        axis.text.y = element_text(color = "#000000", size = 15),
        legend.position = 'none',
        strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'))

grid.arrange(p1, p2, ncol = 2)

This plot shows FGA and FG% of shots from different regions on the court. The distance is how far the shooting spot is from the basket.

Shots beyond 23 feet 9 inches from the basket is 3 pointer and others are 2 pointer. The 24+ ft data are similar to that of the 3 pointer in the plot above. The 2 pointer shots can be decomposed into 3 types – ‘near basket’(Less than 8 ft.), ‘mid-range’(8-16 ft.), ‘long-range’(16-24 ft.).

We can see from the left plot that ‘near basket’ 2 pointer FGA is the most among all and it reaches 30 in 2019 which is even more than the sum of the other two types. While ‘long-range’ shots keeps going down and ‘mid-range’ remains around 12. Considering the difficulty of making a field goal rises with the distance from the basket, ‘long-range’ shots seem to be less valuable than ‘near basket’ ones. In the right plot, we can see ‘near basket’ shots’ FG% goes far beyond others and reached 58% in 2019 while ‘mid-range’ shots’ FG% also keeps rising.

NBA teams keep throwing more 3 pointer and in the meanwhile raise the FG% of 2 pointer. They decrease the attempts to shoot from ‘low efficiency’ regions and focus more near the basket.

Team_splits %>% select(year, pctTS, pctWins) %>% group_by(year) %>%
  ggplot(aes(x = year, y=pctTS, alpha = pctWins, color = pctWins)) +
  geom_jitter(size = 2) +
  scale_colour_gradient(low = "#8ec5ff", high = "#19293a",breaks
=c(0.2,0.4,0.6,0.8), labels=c("20%","40%","60%","80%"))+
  geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2, show.legend = F) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5), legend.position = 'none') +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
  ggtitle('Average True Shooting Percentage Per Game') +
  guides(alpha=FALSE)+
  xlab('') +
  ylab('') +
  labs(colour="% Win")+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11),
        axis.text.y = element_text(color = "#000000", size = 11),
        plot.title = element_text(size = 17.5, face = 'bold'))

TS% (True Shooting Percentage, measures efficiency at shooting the ball) synthesizes field goal percentage, free throw percentage, and 3 pointer field goal percentage instead of taking them individually to calculate shooting more accurately. The same as before, the darker the marker is, the more the team wins.

It’s easy to find that the curve of TS% shares the similar shape of that of offrtg curve and teams at present shoots much more efficiently than 15 years ago.

* TS%=Points/ [2 x (Field Goals Attempted+0.44 x Free Throws Attempted)]. This is a shooting percentage that factors in the value of three-point field goals and free throws in addition to conventional two-point field goals. StatNBA


Sharing the ball

Basketball teamwork is in fact very important as it allows the team to function together and not individually. During offensive situations, teamwork is vital because you need to confuse the defense on who will take the shot or where the shot will come from. If there is only one person making the shot for the team, then the defense will mostly concentrate their efforts in putting a stop to their scorer.

Team_splits %>% select(year, ast, tov) %>% group_by(year) %>% summarise_all(mean) %>%
  gather(key = 'type', value = 'value', -year) %>%
  ggplot(aes(x = year, y=value)) +
  geom_line(color = '#C9082A', size = 2) +
  geom_point(color = '#C9082A', size = 4) +
  geom_point(color = '#FFFFFF', size = 2) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  ggtitle('Average Assists Per Game') +
  facet_grid(type ~ ., scales = 'free_y') +
  xlab('') +
  ylab('') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11), 
        axis.text.y = element_text(color = "#000000", size = 11),
        legend.position = 'none',
        strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
        strip.background = element_rect(fill = '#17408B', colour = 'white'),
        plot.title = element_text(size = 17.5, face = 'bold'))

Ast (Assist, attributed to a player who passes the ball to a teammate in a way that leads to a score by field goal) roughly measures the willingness and ability of a team to share the ball and tov (Turnover, occurs when a team loses possession of the ball to the opposing team before a player takes a shot at their team’s basket) give a angle to view how disciplined the team is.

In this plot, we can see that the assist rising dramatically in 2013 and since then it keeps growing. While turnover first rises until 2014 and starts to drop till now. It’s likely that in 2013 and 2014 teams started to speed up and encourage passing while players didn’t get used to this style and a lot of passes turns into turnover. From 2015, teams began to figure out how to pass the ball right to the scorer and reduce bad passes.

Team_splits %>% select(year, ratioAST, pctWins) %>% group_by(year) %>%
  ggplot(aes(x = year, y=ratioAST/100, alpha = pctWins, color = pctWins)) +
  geom_jitter(size = 2) +
  scale_colour_gradient(low = "#8ec5ff", high = "#19293a",breaks
=c(0.2,0.4,0.6,0.8), labels=c("20%","40%","60%","80%"))+
  geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2, show.legend = F) +
  scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
  guides(alpha=FALSE)+
  ggtitle('Assist Ratio Per Game') +
  xlab('') +
  ylab('') +
  labs(colour="% Win")+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11), 
        axis.text.y = element_text(color = "#000000", size = 11),
        #legend.position = 'none',
        plot.title = element_text(size = 17.5, face = 'bold'))

Assist Ratio is the percentage of a team’s possessions that ends in an assist. The growth is obvious in recent three years and we can see there are teams pretty good at sharing balls these years and they all receive outstanding grades.

* Assist Ratio=(AST * 100) / (POSS). StatNBA



VI. Interactive Plot


The following interactive plots are created in a shiny app. The link to our shiny app is https://cy2507.shinyapps.io/NBA_15years/. You may click on the link to play with the interactive components.

Scoring

This part shows the analysis of changes in players’ scoring methods and efficiency during the last 15 years in the NBA. Definitions of terminologies we’ll mention are on the right part.

We choose players who attend more than 30 games in a season (82 games in total), play for more than 15 minutes per game (48 minutes in total), and at least 0.1 3-pointer attempts.

You can also click on the buttons at the top left of the plot to show only your interested position.

Scoring-Front page

Scoring-Front page

Field Goal Attempts

This plot uses 3-pointer and 2-pointer FGA (Field Goal Attempts) as axes and total FGA as markers’ size to demonstrate scoring methods of teams in different seasons.

From the plot, we can see that starting from 2012, there is a noticeable increase trend in 3-pointer FGA.


Field Goal Percentage

The plot uses 3-pointer and 2-pointer FG% (Field Goal Percentage) as axes and total FGA as markers’ size to demonstrate the scoring efficiency of teams in different seasons.

We can see that 3-pointer field goal percentage is around the same while the 2-pointer field goal percentage is increasing. The strategy of NBA teams to score more is to keep 2 pointer shots more efficient.


3-pointer Performance

The plot uses 3-pointer and 3-pointer percentage as axes and total FGA as markers’ size to demonstrate 3 pointer performance of teams in different seasons.

The 3-pointer field goal percentage is around 30% to 40%. The strategy of NBA teams to score more is to try more 3 pointer.


2-pointer Performance

The plot uses 2-pointer and 2 pointer percentage as axes and total FGA as markers’ size to demonstrate two pointer performance of teams in different seasons.

The 2-pointer field goals are decreasing while the 2-pointer field goals are increasing.


True Shooting Percentage

The plot uses TS%(True Shooting Percentage) and FG(Field Goals) as axes and FGA(Field Goals Attempts) as markers’ size to demonstrate the shooting ability of teams in different seasons.

TS% has increased by around 10% for the past 15 years. Please note that this is quite a huge improvement in shooting ability. Among all the positions, bigs improve the most.


Sharing Ball

This part shows analysis of changes in players’ willingness and ability to share ball during the last 15 years in the NBA. Definitions of terminologies we’ll mention are on the right part.

We choose players who attend more than 30 games in a season (82 games in total), play for more than 15 minutes per game (48 minutes in total).

You can also click on the buttons at the top left of the plot to show only your interested position.

Sharing Ball-Front page

Sharing Ball-Front page

Assist VS Turnover

The plot uses Assist and Turnover as axes and USG%(Usage rate) as markers’ size to demonstrate the shooting ability of teams in different seasons.

The pace of the game is increasing: the Assist and Turnover both increase. Players tend to share the ball and assist others.

* Usage rate, a.k.a., usage percentage is an estimate of the percentage of team plays used by a player while he was on the floor. By balancing usage rates and the varying offensive ratings of the five players on the court, a team can achieve optimal offensive output.

Ast% VS Tov%

The plot uses Ast% and Tov% as axises and USG% as markers’ size to demonstrate shooting ability of teams in different seasons.

Although it seems that the turnover is increasing in the previous plot, the percentage of the turnover is actually around the same and even decreases a little bit.

Another noticeable point is that the ast%/tov% of the wings position is increasing. It suggests they are playing a more import role in organizing offense.



VII. Conclusion


Overview of the Result

  • The age the distribution for the past 15 years is almost the same while minimum age has increased by 1 year since NBA had increased the draft-eligible age from 18 to 19.
  • Increasing of average height/weight ratio suggests players are becoming more and more facile and fast.
  • Pace and Points Per Game have a similar increasing trend. The increase in Pace means a more active offense and have higher chances of scoring.
  • The strategy of NBA teams to score more is to try more 3 pointer and keep 2 pointer shots more efficient.
  • NBA teams keep throwing more 3 pointer and in the meanwhile raise the FG% of 2 pointer. They decrease the attempts to shoot from ‘low efficiency’ regions and focus more near the basket.
  • Assist rises dramatically since 2013 while turnover first rises until 2014 and starts drop till now. This may because in 2013 and 2014 teams started to speed up and encourage passing while players didn’t get used to this style and plenty passes turns into turnover. From 2015, teams began to figure out how to pass the ball right to the scorer and reduce bad passes.


Limitations

  • Focusing only on the last 15 years’ data may lose some long term trends. For example, height and weight data are quite stable in recent years. The change is only noticeable under a 60-years range.
  • We focused more on offense data instead of defense data. This is partly because the defense data is harder to measure.
  • As of offense data, we used mostly shooting data rather than passing data.
  • Since there are too many variables, we only used a limited amount of these variables. We may explore more variables in the future.


Future Directions

  • Explore the changes in defense aspect.
  • Explore more advanced statistics such as Shot Charts, On/Off Splits, Lineup Data, etc.
  • Explore more detailed data such as researching on a particular game rather than the whole season.
  • Explore spatial data such as which particular point has a higher hit rate.



 

A work by Chao Yin & Zeyu Yang

cy2507@columbia.edu | zy2327@columbia.edu